Skip to content

fix(hot): split history shards by encoded size, not count#66

Open
dylanlott wants to merge 1 commit into
mainfrom
dylan/split-shards
Open

fix(hot): split history shards by encoded size, not count#66
dylanlott wants to merge 1 commit into
mainfrom
dylan/split-shards

Conversation

@dylanlott
Copy link
Copy Markdown

@dylanlott dylanlott commented May 20, 2026

Closes ENG-2287

AccountsHistory and StorageHistory are stored DUPSORT, so each dup
value (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT
value limit (~1980 B on 4 KB pages). The previous splitter gated on
ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000
sparse indices serialises to >20 KB. Once a hot account's shard tipped
over, every live block touching it crashed the sidecar with
MDBX_BAD_VALSIZE.

Replace count-based splitting in append_to_sharded_history with a
size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When
the merged list overflows the budget, binary-search the largest prefix
that fits and emit shards until the remainder is consumed. Existing
oversized shards self-heal on the next write to them, so no migration
is required.

Add test_history_shard_fits_in_dupsort_limit to the hot conformance
suite. It exercises the real update_history_indices_inconsistent path
with 2000 sparse blocks (one per roaring container, the worst-case
encoding) and asserts the union across shards round-trips.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Copy link
Copy Markdown
Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Closes ENG-2287

`AccountsHistory` and `StorageHistory` are stored DUPSORT, so each dup
value (key2 || encoded BlockNumberList) is capped at MDBX's DUPSORT
value limit (~1980 B on 4 KB pages). The previous splitter gated on
ShardedKey::SHARD_COUNT (2000), but a roaring BlockNumberList of 2000
sparse indices serialises to >20 KB. Once a hot account's shard tipped
over, every live block touching it crashed the sidecar with
MDBX_BAD_VALSIZE.

Replace count-based splitting in append_to_sharded_history with a
size-based check against a new ShardedKey::MAX_SHARD_BYTES (1500). When
the merged list overflows the budget, binary-search the largest prefix
that fits and emit shards until the remainder is consumed. Existing
oversized shards self-heal on the next write to them, so no migration
is required.

Add test_history_shard_fits_in_dupsort_limit to the hot conformance
suite. It exercises the real update_history_indices_inconsistent path
with 2000 sparse blocks (one per roaring container, the worst-case
encoding) and asserts the union across shards round-trips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dylanlott dylanlott force-pushed the dylan/split-shards branch from cf04073 to c512dff Compare May 20, 2026 06:11
@dylanlott dylanlott requested a review from prestwich May 20, 2026 06:20
@dylanlott dylanlott marked this pull request as ready for review May 20, 2026 06:21
/// Soft cap on the number of indices in one shard.
///
/// This is a sanity ceiling used alongside [`Self::MAX_SHARD_BYTES`];
/// shard splitting is driven by encoded size, not by this count.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we should consider is whether we can empirically find size bounds with best and worst case inputs

Copy link
Copy Markdown
Member

@prestwich prestwich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are writing 3 cases:

  1. exceeds MAX_SHARD_BYTES
  2. exceeds MAX_SHARD_BYTES and the MDBX dupsort max
  3. (boring case) does not exceed MAX_SHARD_BYTES

previously:

  • case 2 would panic during writing
  • case 1 & 3 would work

there are no changes to the decoding, so anything written with previous versions can still decode. Different versions would chunk history differently, and therefore have slightly different real-world perf, but there should not be any version compatibility issues, as the history chunking is not consensus critical

  1. How does the treemap encode the indicies?
  2. Can we empirically determine a safe number of indices based on the encoding so that we don't have to do a binary search?
  3. In pursuit of 2, let's add tests to the storage-types package with worst-case indices.
  4. Let's promote the SHARD_COUNT to a type-associated constant on our BlockNumberList

@prestwich
Copy link
Copy Markdown
Member

[Claude Code]

Posting on James's behalf — follow-up to the size-based-splitting discussion.

Agree the size-based approach is the right call (the count-based alternative imposes ~10× metadata overhead on hot, dense addresses, which dominate real workloads). The objection is to the binary search, not the size budget.

Why the current splitter is more expensive than it needs to be

max_prefix_fitting does up to log₂(n) probes per shard, and each probe calls BlockNumberList::new_pre_sorted(&indices[..mid]) + serialized_size() — i.e., constructs a fresh RoaringTreemap from scratch per iteration. For a 2000-entry oversized merge that splits into ~13 shards, that's ~140 full roaring constructions on the overflow path. Not catastrophic (overflow is rare), but unnecessary.

Also, the if mid == 1 { break; } branch silently returns best = 1 (the initial value, never probed) without verifying that a 1-element list actually fits. Theoretical footgun.

Three cheaper heuristics

A. Incremental insert + exact size check (simplest). Insert indices one at a time into a working BlockNumberList; query serialized_size() after each. It's O(num_containers) per call, not O(n), so total work is O(n × containers) — for the dense path (1 container) effectively O(n); for the pathological all-distinct-containers path O(n²) ≈ 4M tiny ops, still cheaper than the current binary search's repeated full re-encodings. No probing, exact.

B. Boundary-triggered estimate (O(n)). serialized_size() only jumps when a new container is created or a container transitions representation (array↔bitmap↔run). Both are detectable from the input itself: (idx >> 16) changing (new container) or input crossing density thresholds. Maintain a running upper-bound estimate (treat every container as array: +12 B per new container, +2 B per element); emit when it crosses budget. Conservative — gives up some bitmap/run compression slack on dense paths — so periodically recalibrate with a real serialized_size() to recover it.

C. Segment-based greedy pack (O(n), exact). Pre-segment the merged input by (hi32, hi16) in one pass; each segment's encoded cost is predictable from cardinality. Greedy-pack segments into shards.

Recommendation: Option A

Smallest delta from the current PR. Replace max_prefix_fitting with a single-pass incremental builder:

let mut shard_start = 0;
let mut shard = BlockNumberList::default();
for (i, &idx) in all.iter().enumerate() {
    shard.push(idx)?;
    if shard.serialized_size() > ShardedKey::MAX_SHARD_BYTES {
        // emit prefix [shard_start..i) — rebuild once at the boundary
        let prefix = BlockNumberList::new_pre_sorted(all[shard_start..i].iter().copied());
        write_shard(all[i - 1], &prefix)?;
        shard_start = i;
        shard = BlockNumberList::default();
        shard.push(idx)?;
    }
}
write_shard(u64::MAX, &shard)?;

One roaring rebuild per shard boundary instead of log₂(n) per shard. Exact, no probing, simpler to read.

Orthogonal asks from your earlier review still stand

The four points from the change-request — characterise roaring encoding, add worst-case tests in storage-types, derive a documented safe upper bound, and move encoding-derived constants onto BlockNumberList — are independent of which splitting heuristic we use, and would land cleanly on top of Option A. The worst-case test in signet-storage-types is what would justify the MAX_SHARD_BYTES = 1500 number rigorously (right now it's asserted by comment, not by test).

@prestwich
Copy link
Copy Markdown
Member

[Claude Code]

Follow-up posted on James's behalf — layering observation, separable from the splitting-heuristic discussion above.

MAX_SHARD_BYTES = 1500 currently lives in signet-storage-types on ShardedKey, but that number is a property of MDBX's DUPSORT value cap (~1980 B on 4 KB pages, minus key2 and per-node overhead). Other backends that implement HotKv — MemKv today, hypothetical SQL/remote backends later — have no such limit, and forcing them to split into ~1500-byte shards just wastes work and inflates their on-disk/in-memory layout for no reason.

The dual-keyed table abstraction itself is fine; that's a deliberate primitive in this storage layer, not an MDBX-ism. The issue is purely that the byte budget is backend-specific while the splitting policy is generic.

Minimal cleanup that fixes the layering without touching the trait surface:

trait HotKvWrite {
    /// Maximum encoded size of a single stored value, in bytes.
    /// `None` means unbounded.
    fn max_value_bytes(&self) -> Option<usize> { None }
}
  • signet-hot-mdbx overrides to return Some(1500) (or computes from page size).
  • MemKv keeps the default None and writes one monolithic shard per address.
  • append_to_sharded_history reads self.max_value_bytes(); if None, fast-path emit; otherwise run the size-based splitter against that budget.

This keeps the shared splitter (so we don't duplicate the algorithm per backend), keeps the schema shared (dual-keyed history tables remain the abstraction's primitive), and removes the MDBX-specific constant from signet-storage-types.

BlockNumberList::serialized_size() stays in signet-storage-types — it's a property of the encoding. Only the budget moves.

Happy to file this as its own issue and let ENG-2287 land first, or fold it into the same PR if it's small enough. Probably the former — different scope, easier to review separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants